Skip to content

Conversation

@apappascs
Copy link
Contributor

This commit introduces the JsoupDocumentReader and JsoupDocumentReaderConfig classes, which provide functionality to read and parse HTML documents using the JSoup library.

The reader supports:

  • Extracting text from specific HTML elements using CSS selectors.
  • Extracting all text from the body of the document.
  • Grouping text by element.
  • Extracting metadata, including the document title, meta tags, and link URLs.
  • Reading from various resource types (files, URLs, byte arrays).
  • Configurable character encoding, selector, separator, and metadata extraction.

This commit introduces the `JsoupDocumentReader` and `JsoupDocumentReaderConfig` classes, which provide functionality to read and parse HTML documents using the JSoup library.

The reader supports:
- Extracting text from specific HTML elements using CSS selectors.
- Extracting all text from the body of the document.
- Grouping text by element.
- Extracting metadata, including the document title, meta tags, and link URLs.
- Reading from various resource types (files, URLs, byte arrays).
- Configurable character encoding, selector, separator, and metadata extraction.

This new reader enhances Spring AI's ability to process web content and other HTML-based data sources.

Signed-off-by: Alexandros Pappas <[email protected]>
@apappascs apappascs force-pushed the feature/jsoup-html-reader branch from 2879e6c to c0ef4ac Compare February 14, 2025 15:55
@ilayaperumalg ilayaperumalg self-assigned this Mar 10, 2025
@ilayaperumalg ilayaperumalg added this to the 1.0.0-M7 milestone Mar 10, 2025
@ilayaperumalg
Copy link
Member

@apappascs This is a nice addition and thank you for adding! Rebased and merged as 82b46d2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants